-
Notifications
You must be signed in to change notification settings - Fork 467
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HnswDensevector SafeTensor Generator #2515
Conversation
Setup for NFCorpus Indexing with SafetensorsTo efficiently perform NFCorpus indexing using Safetensors, follow this setup workflow:
Indexing ProcedureTo build HNSWSafetensors indexes, use the following sample command:
Ensure all paths and parameters are adjusted according to your setup and directory structure. |
Can you make the safetensors collection go into We also shouldn't need a new indexer. The indexing command should be similar to https://github.com/castorini/anserini/blob/master/docs/regressions/regressions-beir-v1.0.0-nfcorpus-bge-base-en-v1.5-hnsw.md e.g.,
With the only exception being a different |
Updated Workflow for Safetensors Conversion and Indexing Process
|
...in/java/io/anserini/index/generator/HnswJsonWithSafeTensorsDenseVectorDocumentGenerator.java
Outdated
Show resolved
Hide resolved
...in/java/io/anserini/index/generator/HnswJsonWithSafeTensorsDenseVectorDocumentGenerator.java
Outdated
Show resolved
Hide resolved
...in/java/io/anserini/index/generator/HnswJsonWithSafeTensorsDenseVectorDocumentGenerator.java
Outdated
Show resolved
Hide resolved
Updates
Updated commandsPythonpython src/main/python/safetensors/json_to_bin.py --input collections/beir-v1.0.0/bge-base-en-v1.5/nfcorpus/vectors.part00.jsonl --output collections/beir-v1.0.0/bge-base-en-v1.5.safetensors/nfcorpus Javabin/run.sh io.anserini.index.IndexHnswDenseVectors -collection JsonDenseVectorCollection -input collections/beir-v1.0.0/bge-base-en-v1.5/nfcorpus -generator HnswJsonWithSafeTensorsDenseVectorDocumentGenerator -index indexes/beir-v1.0.0/bge-base-en-v1.5/nfcorpus/ -threads 16 -M 16 -efC 100 -memoryBuffer 65536 -noMerge >& logs/log.beir-v1.0.0-nq.bge-base-en-v1.112 & |
62cd3c7
to
ff75047
Compare
updated command :
|
src/main/java/io/anserini/collection/SafeTensorsDenseVectorCollection.java
Show resolved
Hide resolved
...ava/io/anserini/index/generator/HnswJsonWithSafeTensorsDenseVectorDocumentGeneratorTest.java
Outdated
Show resolved
Hide resolved
@Panizghi if I'm reading your code correctly, you're assuming that there's only one vector file per directory, right? This is not necessary the case. For example, for
|
@Panizghi on your branch, running:
Works fine. However, I would like some progress indication... e.g., using tqdm? Also, what do I do if there is more than one vector part? However, more compact, as excepted, which is good.
|
Running indexing command:
Something's not right... get an exception:
|
Yes that is correct on the early discussion we kep it only for nfcorpus with single file, I will update the code for the multiple file handling |
This should be fixed now and work with the same command |
tqdm is added there is --overwrite in arguments which also you can use if the file already exists command:
sample output :
For vector parts are we considering a case like this?
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please update Submodule tools
in your branch to bring up to date w/ master
.
Okay, I can now run these commands:
After I build the index, I should be able to switch to retrieval, here: https://github.com/castorini/anserini/blob/master/docs/regressions/regressions-beir-v1.0.0-nfcorpus.bge-base-en-v1.5.hnsw.onnx.md The retrieval command is this:
But the eval command generates errors:
From here:
I appear to be getting duplicates of docs, e.g., |
That was initially the reason I swapped to single thread and having critical section testing the fix right now |
Updated command :
Indexing Performance:
File Sizes:
|
Superseded by #2582 |
Linked issue : castorini/ura-projects#31 (comment)
@17Melissa will provide the flow command below :)